Active Learning for Logistic Regression

نویسندگان

  • Andrew Ian Schein
  • Lyle H. Ungar
  • Gary Morris
  • S. Ted Sandler
  • Weichen Wu
چکیده

ACTIVE LEARNING FOR LOGISTIC REGRESSION Andrew Ian Schein Supervisor: Lyle H. Ungar Which active learning methods can we expect to yield good performance in learning logistic regression classifiers? Addressing this question is a natural first step in providing robust solutions for active learning across a wide variety of exponential models including maximum entropy, generalized linear, loglinear, and conditional random field models. We extend previous work on active learning using explicit objective functions by developing a framework for implementing a wide class of loss functions for active learning of logistic regression, including variance (A-optimality) and log loss reduction. We then run comparisons against different variations of the most widely used heuristic schemes: query by committee and uncertainty sampling, to discover which methods work best for different classes of problems and why. Our approach to loss functions for active learning borrows from the field of optimal experimental design in statistics. We exploit several properties of nonlinear regression models that allow computation of the variance of a prediction with respect to the model’s input distribution. The strategy of minimizing prediction variance is referred to as A-optimality. A Taylor series approximation of many loss functions conveniently factorizes into alternative weightings of this variance computation. We investigate squared and log loss within this framework. Our empirical evaluations are the largest effort to date to evaluate explicit objective function methods in active learning. We employed ten data sets in the evaluation from domains such as image recognition and document classification. The data sets vary in number of categories from 2 to 26 and have as many as 6, 191 predictors. This work establishes the benefits of these often cited (but rarely used) strategies, and counters the claim that experimental design methods are too computationally vii complex to run on interesting data sets. The two loss functions were the only methods we tested that always performed at least as well as a randomly selected training set. The same data were used to evaluate several heuristic methods, including uncertainty sampling, heuristic variants of the query by committee method, and a method that maximizes classifier certainty. Uncertainty sampling was tested using two different measures of uncertainty: Shannon entropy and margin size. Margin-based uncertainty sampling was found to be superior; however, both methods perform worse than random sampling at times. We show that these failures to match random sampling can be caused by predictor space regions of varying noise or model mismatch. The various heuristics produced mixed results overall in the evaluation, and it is impossible to select one as particularly better than the others when using classifier accuracy as the sole criterion for performance. Margin sampling is the favored approach when computational time is considered along with accuracy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Benchmark and Comparison of Active Learning for Logistic Regression

Various active learning methods based on logistic regression have been proposed. In this paper, we investigate seven state-of-the-art strategies, present an extensive benchmark, and provide a better understanding of their underlying characteristics. Experiments are carried out both on 3 synthetic datasets and 43 real-world datasets, providing insights into the behaviour of these active learning...

متن کامل

Active Learning for Multi-Class Logistic Regression

Which of the many proposed methods for active learning can we expect to yield good performance in learning logistic regression classifiers? In this article, we evaluate different approaches to determine suitable practices. Among our contributions, we test several explicit objective functions for active learning: an empirical consideration lacking in the literature until this point. We develop a...

متن کامل

A-Optimality for Active Learning of Logistic Regression Classifiers

Over the last decade there has been growing interest in pool-based active learning techniques, where instead of receiving an i.i.d. sample from a pool of unlabeled data, a learner may take an active role in selecting examples from the pool. Queries to an oracle (a human annotator in most applications) provide label information for the selected observations, but at a cost. The challenge is to en...

متن کامل

Sample size determination for logistic regression

The problem of sample size estimation is important in medical applications, especially in cases of expensive measurements of immune biomarkers. This paper describes the problem of logistic regression analysis with the sample size determination algorithms, namely the methods of univariate statistics, logistics regression, cross-validation and Bayesian inference. The authors, treating the regr...

متن کامل

Active Learning with Rationales for Text Classification

We present a simple and yet effective approach that can incorporate rationales elicited from annotators into the training of any offthe-shelf classifier. We show that our simple approach is effective for multinomial naı̈ve Bayes, logistic regression, and support vector machines. We additionally present an active learning method tailored specifically for the learning with rationales framework.

متن کامل

Hyperspectral segmentation with active learning

This paper introduces a new supervised Bayesian approach to hyperspectral image segmentation, with two main steps: (a) learning, for each class label, the posterior probability distributions, based on a multinomial logistic regression model; (b) segmenting the hyperspectral image, based on the posterior probability distribution learnt in step (a) and on a multi-level logistic prior encoding the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005